Boosting Unsupervised Grammar Induction by Splitting Complex Sentences on Function Words
نویسندگان
چکیده
The statistical-structural algorithm for unsupervised language acquisition, ADIOS (for Automatic DIstillation Of Structure), developed by Solan et al. (2005), has been shown capable of learning precise and productive grammars from realistic, raw and unannotated corpus data, including transcribed children-directed speech from the CHILDES corpora, in languages as diverse as English and Mandarin. This algorithm, however, does not deal well with grammatically complex texts: the patterns it detects in complex sentences often combine parts of different clauses, negatively affecting performance. We address this problem by employing a twostage learning technique. First, complex sentences are split into simple ones around function words, and the resulting corpus is used to train an ADIOS learner. Second, the original complex corpus is used to complete the training. We also show how the function words themselves can be learned from the corpus using unsupervised distributional clustering.
منابع مشابه
Bootstrapping Dependency Grammar Inducers from Incomplete Sentence Fragments via Austere Models
Modern grammar induction systems often employ curriculum learning strategies that begin by training on a subset of all available input that is considered simpler than the full data. Traditionally, filtering has been at granularities of whole input units, e.g., discarding entire sentences with too many words or punctuation marks. We propose instead viewing interpunctuation fragments as atoms, in...
متن کاملUnsupervised language acquisition: syntax from plain corpus
We describe results of a novel algorithm for grammar induction from a large corpus. The ADIOS (Automatic DIstillation of Structure) algorithm searches for significant patterns, chosen according to context dependent statistical criteria, and builds a hierarchy of such patterns according to a set of rules leading to structured generalization. The corpus is thus generalized into a context free gra...
متن کاملIdentifying Patterns for Unsupervised Grammar Induction
This paper describes a new method for unsupervised grammar induction based on the automatic extraction of certain patterns in the texts. Our starting hypothesis is that there exist some classes of words that function as separators, marking the beginning or the end of new constituents. Among these separators we distinguish those which trigger new levels in the parse tree. If we are able to detec...
متن کاملThree Dependency-and-Boundary Models for Grammar Induction
We present a new family of models for unsupervised parsing, Dependency and Boundary models, that use cues at constituent boundaries to inform head-outward dependency tree generation. We build on three intuitions that are explicit in phrase-structure grammars but only implicit in standard dependency formulations: (i) Distributions of words that occur at sentence boundaries — such as English dete...
متن کاملUnsupervised Dependency Parsing without Gold Part-of-Speech Tags
We show that categories induced by unsupervised word clustering can surpass the performance of gold part-of-speech tags in dependency grammar induction. Unlike classic clustering algorithms, our method allows a word to have different tags in different contexts. In an ablative analysis, we first demonstrate that this context-dependence is crucial to the superior performance of gold tags — requir...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007